Now that the 2020 election is officially over and Biden was elected as the President of the United States, it is important that I reflect on my prediction model. I am excited to see how I could learn from my model for future models that I create.
Let’s first recap on my prediction model to get a better picture of what it was.
My prediction model was an ensemble model that predicted the popular vote share for each state. .
Given that the Time For Change Model was an inspiration, I decided to focus my model on historical republican vote share as Trump was the incumbent for the 2020 election and incumbency was used as a predictor in the Time For Change Model.
I decided to separate America into three categories - red states, blue states, and battleground states - for my model to adjust for overfitting. The grouping were based on how FiveThirtyEight grouped states.
My model used the following data:
In my model, I decided to classify approval, Q2 GDP growth, and turnout as fundamentals.
Thus, my ensemble model weighted the poll model (using only polls) by 0.96 and the fundamental model (using only fundamentals) by 0.04 as I weighted the model based on FiveThirtyEight’s reasoning that polls are better predictors as the election nears since fundamentals become more noisy instead.
My final prediction using the ensemble model was that Biden was projected to win 310 electoral votes while Trump is projected to win 228 electoral votes, predicting Biden would become president-elect of the United States.
Overall, I am pretty satisfied with how my model turned out. While I did miss a few states and this is my first election forecast, I was quite happy that I predicted some battleground states correctly.
Above is a comparison between my predictions and the actual results of the 2020 election. As you can see, the states that I got wrong were battleground states. However, I would like to say that the predictive intervals for the battleground states did capture the true result.
Moreover, let’s take a look into the plot above, which plots the actual two-party vote share for Trump against my predictions for Trump. The blue points represent states Biden won and the red points represent states Trump won.
Furthermore, the map above shows the difference between Trump’s actual and predicted two party vote share in each state. A negative difference means that Trump was overpredicted for that particular state while a positive difference means that Trump was underpredicted for that particular state.
Now that we have went over my prediction model, it is important to look at possible hypotheses for the inaccuracies seen in my model. My model seemed to incorrectly predict the results for battleground states in particular and it is important we pay attention to the reasons why. Below are my hypotheses for explaining the overall inaccuracies of my model:
To test the first hypothesis that states and counties have partisan shifts (which can impact the accuracy of my model), I can look at recent voting trends in such states and counties. These states are likely battleground states and the counties are likely in battleground states. Moreover, we can look at how states and counties voted in the 2016 presidential election, the 2018 midterm election, and the 2020 election. We can thus analyze any trends with regressions and correlations and if we see any trends where certain states and counties are shifting blue or red, that is something to take note of. One example of a trend that we may see is how southern Texas counties have been voting towards more red overtime in comparison to the 2008 election.
To test the second hypothesis, we can run a linear regression model between the popular vote share for a presidential candidate (say the incumbent) and the change in turnout for different demographics. Through this model, we may have a better idea of not only how changes in turnout rates from different demographics may impact election forecasting but also how they may affect democratic or republican popular vote share. Furthermore, it may also make sense to run the model on a per county basis since it was evidenced from 2020 that certain counties see greater turnout rates from particular demographics than other counties.
To test the third hypothesis, one test that can be used is create a predictive linear model for the popular vote share for a candidate only using recent polls. Given that historical polls may not be as predictive for today’s elections, it may make sense to only use recent polls like from 2016 onward. This might be because there was never really a president with the character of Trump and so there may be non-response bias among republicans as some republicans may be afraid to alert pollsters that they will vote for Trump. Moreover, it would also make sense to use polls that are high in quality, which can be measured using FiveThirtyEight poll grades. My prediction model did not filter out for high quality polls, so the quality of polls may impact the results from election forecasting as they may be more representative of society. I would also mention that polls need to do a better job in reaching out to hard to reach demographics like Hispanic Americans, and so I would be interested to use more polls that target these demographics for predictive models.
To test the fourth hypothesis, we can use my prediction model but not include economic predictors as part of the fundamental. As mentioned before, the 2020 economy was an anomaly, so it is best to not use economic predictors. Moreover, it would be interesting to use economic predictors for the 2024 election and other future elections given that the economic predictors during those elections are not all over the place. If we do use economic predictors to predict those future elections, it makes sense to leave out the 2020 economic variables then.
Now that we have a better grasp of understanding my model and where it went wrong, the following are changes I would like to do to my model:
I would use recent polling (with high grades from FiveThirtyEight) instead of historical polling in my model. This is because many states are recently having partisan shifts in vote share, and so historical polling may not be that helpful in predicting the 2020 election. Additionally, the 2020 Election map did not differ greatly from 2016, so using recent polling may be more accurate for forecasting. Plus, polls with high grades may mean they are higher in quality and thus are more representative of society.
Instead of accounting for overall change in turnout rate, it may make sense to focus on expected change in turnout rate for particular demographics, such as Hispanics and African Americans. This is because many battleground states were determined by the turnout from these demographic groups, evidenced by the Cuban vote in Miami-Dade County and the African American vote in Fulton county.
I would not use Q2 GDP growth rate or any economic predictor for this model because the 2020 economy is an anomaly and so historical economic predictors may not be good for predicting the 2020 election.
I would also like to potentially make a prediction model that generates county-level predictions instead of state-level predictions. This is because it appears that the results of many states are determined by specific counties and so it would be noteworthy to make a prediction model that generates county-level predictions to see if it is more accurate.
I really enjoyed making my prediction model and learning from it. I now have a better grasp in making predictive models in the future and I am eager to see how future elections will differ from the 2020 election. I want to thank my teaching fellow for Gov 1347, Sun Young Park, as well as Professor Ryan D. Enos and Soubhik Barari, for their teaching.